Developing a Persian Part of Speech Tagger

نویسنده

  • Karine Megerdoomian
چکیده

Assigning grammatical categories to words in a text is an important component of a natural language processing (NLP) system. Corpora tagged with Part of speech (POS) information are often used as a prerequisite for more complex NLP applications such as information extraction, syntactic parsing, machine translation or semantic field annotation. They are also used to help train statistical models. Prior to tagging, a natural language processing system generally requires modules for segmenting tokens in the text and providing a morphological analysis. The actual annotation scheme used, however, is often motivated by the system application. This paper outlines some of the main challenges that arise in the development of a Persian POS tagger – such as encoding issues, long-distance dependencies in morphology, recognition of complex tokens, word and phrasal boundaries, and analysis of multiword expressions – and proposes approaches to resolving these issues.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Statistical Part-of-Speech Tagger for Persian

This paper presents the statistical part-ofspeech tagger HunPoS trained on a Persian corpus. The result of the experiments shows that HunPoS provides an overall accuracy of 96.9%, which is the best result reported for Persian part-of-speech tagging.

متن کامل

A Persian Part-Of-Speech Tagger Based on Morphological Analysis

This paper describes a method based on morphological analysis of words for a Persian Part-Of-Speech (POS) tagging system. This is a main part of a process for expanding a large Persian corpus called Peyekare (or Textual Corpus of Persian Language). Peykare is arranged into two parts: annotated and unannotated parts. We use the annotated part in order to create an automatic morphological analyze...

متن کامل

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

Creating a Feasible Corpus for Persian POS Tagging

This paper describes creation of a test collection for Persian Part of Speech Tagging experiments. This collection was created by modifying a manually Part of Speech (POS) tagged Persian corpus with over two million tagged words. The original collection had a tag set of 550 tags that are more than what any machine learning algorithm can handle. The number of tags for these experiments was reduc...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005